Skip to content

Include metadata in numpy/polars cache fingerprints to prevent collisions#1616

Merged
skrawcz merged 1 commit into
mainfrom
stefan/fix-cache-fingerprinting-collisions
Jun 1, 2026
Merged

Include metadata in numpy/polars cache fingerprints to prevent collisions#1616
skrawcz merged 1 commit into
mainfrom
stefan/fix-cache-fingerprinting-collisions

Conversation

@skrawcz

@skrawcz skrawcz commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Fix hash collisions in the caching subsystem's fingerprinting for numpy arrays and polars DataFrames.

Problem

  • hash_numpy_array used only obj.tobytes(), which discards shape and dtype. Arrays with identical raw bytes but different shapes (e.g., shape=(6,) vs shape=(2,3)) or different dtypes (e.g., float32(1.0) vs int32(1065353216)) produced identical cache keys.

  • hash_polars_dataframe used only obj.hash_rows(), which discards column names. DataFrames with identical cell values but different schemas produced identical cache keys.

Both could cause the cache to silently return incorrect results from a previous execution.

Fix

  • hash_numpy_array: prepend f"{obj.shape}:{obj.dtype}" to the bytes before hashing
  • hash_polars_dataframe: include column names and dtypes (schema) alongside row hashes

Backwards compatibility

This changes hash output for numpy arrays and polars DataFrames. Existing caches will miss (different hash = recomputation), not produce incorrect results. Users will see a one-time recomputation after upgrading but no manual cache clearing is needed.

Tests

Added tests verifying:

  • Different shapes produce different hashes
  • Different dtypes with same bit pattern produce different hashes
  • Different column names produce different hashes
  • Identical data still produces identical hashes

Reported-by: Dem0

…ions

hash_numpy_array now includes shape and dtype in the hash, preventing
collisions between arrays with identical raw bytes but different
semantics (e.g., shape=(6,) vs shape=(2,3)).

hash_polars_dataframe now includes column names and dtypes (schema)
in the hash, preventing collisions between DataFrames with identical
cell values but different column schemas.

Existing caches will simply miss (different hash = recomputation),
not produce incorrect results.

Reported-by: Dem0

@elijahbenizzy elijahbenizzy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And closing out my duplicate

@Dev-iL Dev-iL left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're redoing hashes - perhaps it might be worth switching from sha224/md5 to xxh3?

@ArnavBalyan ArnavBalyan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm ty for the change

@skrawcz

skrawcz commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

If we're redoing hashes - perhaps it might be worth switching from sha224/md5 to xxh3?

separate PR? would have to validate license, etc.

@skrawcz skrawcz merged commit 7644a30 into main Jun 1, 2026
6 checks passed
@skrawcz skrawcz deleted the stefan/fix-cache-fingerprinting-collisions branch June 1, 2026 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants